127 research outputs found

    Towards Structural Classification of Proteins based on Contact Map Overlap

    Get PDF
    A multitude of measures have been proposed to quantify the similarity between protein 3-D structure. Among these measures, contact map overlap (CMO) maximization deserved sustained attention during past decade because it offers a fine estimation of the natural homology relation between proteins. Despite this large involvement of the bioinformatics and computer science community, the performance of known algorithms remains modest. Due to the complexity of the problem, they got stuck on relatively small instances and are not applicable for large scale comparison. This paper offers a clear improvement over past methods in this respect. We present a new integer programming model for CMO and propose an exact B &B algorithm with bounds computed by solving Lagrangian relaxation. The efficiency of the approach is demonstrated on a popular small benchmark (Skolnick set, 40 domains). On this set our algorithm significantly outperforms the best existing exact algorithms, and yet provides lower and upper bounds of better quality. Some hard CMO instances have been solved for the first time and within reasonable time limits. From the values of the running time and the relative gap (relative difference between upper and lower bounds), we obtained the right classification for this test. These encouraging result led us to design a harder benchmark to better assess the classification capability of our approach. We constructed a large scale set of 300 protein domains (a subset of ASTRAL database) that we have called Proteus 300. Using the relative gap of any of the 44850 couples as a similarity measure, we obtained a classification in very good agreement with SCOP. Our algorithm provides thus a powerful classification tool for large structure databases

    N–Dimensional Orthogonal Tile Sizing Problem

    Get PDF
    AMS subject classification: 68Q22, 90C90We discuss in this paper the problem of generating highly efficient code when a n + 1-dimensional nested loop program is executed on a n-dimensional torus/grid of distributed-memory general-purpose machines. We focus on a class of uniform recurrences with non-negative components of the dependency matrix. Using tiling the iteration space strategy we show that minimizing the total running time reduces to solving a non-trivial non-linear integer optimization problem. For the later we present a mathematical framework that enables us to derive an O(n log n) algorithm for finding a good approximate solution. The theoretical evaluations and the experimental results show that the obtained solution approximates the original minimum sufficiently well in the context of the considered problem. Such algorithm is realtime usable for very large values of n and can be used as optimization techniques in parallelizing compilers as well as in performance tuning of parallel codes by hand

    Solving Maximum Clique Problem for Protein Structure Similarity

    Get PDF
    A basic assumption of molecular biology is that proteins sharing close three-dimensional (3D) structures are likely to share a common function and in most cases derive from a same ancestor. Computing the similarity between two protein structures is therefore a crucial task and has been extensively investigated. Evaluating the similarity of two proteins can be done by finding an optimal one-to-one matching between their components, which is equivalent to identifying a maximum weighted clique in a specific "alignment graph". In this paper we present a new integer programming formulation for solving such clique problems. The model has been implemented using the ILOG CPLEX Callable Library. In addition, we designed a dedicated branch and bound algorithm for solving the maximum cardinality clique problem. Both approaches have been integrated in VAST (Vector Alignment Search Tool) - a software for aligning protein 3D structures largely used in NCBI (National Center for Biotechnology Information). The original VAST clique solver uses the well known Bron and Kerbosh algorithm (BK). Our computational results on real life protein alignment instances show that our branch and bound algorithm is up to 116 times faster than BK for the largest proteins

    Integer Programming Approach for Nested Pairs Genome Scaffolding

    Get PDF
    Scaffolding step in the genome assembly aims to determine the order and the orientation of a huge number of previously assembled genomic fractions (contigs/scaffolds). Here we introduce a particular case of this problem and denote it by Nested Pairs Scaffolding. We formulate it as an optimisation problem and propose an integer programming formulation for its resolution. The performed computational results on real and synthetic data show an excellent behaviour of our formulation

    Graphes de Chevauchements à destination d'Algorithmes d'Assemblage et de Scaffolding : Retours sur les Paradigmes et Propositions d'Implémentations

    Get PDF
    Assembling DNA fragments based on their overlaps remains the main assembly paradigm with long DNA fragments sequencing technologies, independently of the aim to resolve only one or several haplotypes. Since an overlap can be seen as a succession relationship between two oriented fragments, the directed graph structure has emerged as an appropriate data structure for handling overlaps. However, this graph paradigm does not appear to take benefit of the reverse symmetry of the orientated fragments and their overlaps, which is a result of blind DNA double-strand sequencing. Thus, the bi-directed graph paradigm was introduced in 1995 towards reducing the graph size by handling the reverse symmetry, and becomes since then the main graph paradigm used in assembly/scaffolding methods. Nevertheless, the available graph paradigms have never been contrasted before, and no implementations have been described. Here we make a complete review on the existing overlap graph paradigms. Furthermore, we present suitable data structures that are theoretically compared in terms of time and memory consumption in the context of the design of some basic graph algorithms. We also show that each one of the paradigms can be switched to another by slightly modifying their data structures

    Lagrangian Approaches for a class of Matching Problems in Computational Biology

    Get PDF
    This paper presents efficient algorithms for solving the problem of aligning a protein structure template to a query amino-acid sequence, known as protein threading problem. We consider the problem as a special case of graph matching problem. We give formal graph and integer programming models of the problem. After studying the properties of these models, we propose two kinds of Lagrangian relaxation for solving them. We present experimental results on real life instances showing the efficiency of our approaches

    Flexible Alignments for Protein Threading

    Get PDF
    We present a new local alignment method for the protein threading problem. Local sequence-sequence alignments are widely used to find functionally important regions in families of proteins. However, to the best of our knowledge, no local sequence-structure alignment algorithm has been described in the literature. Here we model local alignments as Mixed Integer Programming (MIP) models. These models permit to align a part of a protein structure onto a protein sequence in order to detect local similarities. The paper describes two MIP models, compares and analyzes their performance by using ILOG CPLEX 10 solver

    Scaffolding Optimal pour les Régions Répétées Inverses-Complémentaires de Génome de Chloroplastes

    Get PDF
    International audienceScaffolding step in the genome assembly aims to determine the order and the orientation of a huge number of previously assembled genomic fractions (contigs/scaffolds). Here we introduce a particular case of this problem and denote it by Nested Inverted Fragments Scaffolding (NIFS). We formulate it as an optimisation problem in a particular kind of directed graph that we call Multiplied Doubled Contigs Graph (MDCG). Furthermore, we prove that the NIFS problem is NP-Hard. We also discuss how the chloroplast data have been generated by filtering the reads sequenced both from plants and chloroplasts. Moreover, we propose a graph structure to visualise the solution and to highlight the particularity of chloroplast's regions structure

    Optimal de novo assemblies for chloroplast genomes based on inverted repeats patterns

    Get PDF
    International audienceBackground Chloroplast genome assembly remains challenging because sequencing step outputs short reads both from plant and plastid genomes. Some recent dedicated assemblers [1,2] use the information of a highly conserved circular and quadripartite structure with a pair of dispersed inverted repeat regions in chloroplast genomes. Materials and methods We designed a dedicated pattern-driven de novo assembler which requires short unpaired reads uniquely (distances provided by paired-reads are not needed), sequenced from both the plant and its chloroplasts. A first step consists in separating the chloroplasts reads from the reads specific to plant. To this end we use the observation that the chloroplast genomes are over-represented compared to the plant genome. Then we compute an estimated coverage of the pre-assembled contigs and we keep the ones with higher coverage. The first step outputs an assembly graph where each vertex corresponds to a contig and is provided with an estimated multiplicity number. In the sequel we use another graph where each vertex is duplicated according to its multiplicity number and to the two possible contig orientations. The edges are duplicated respectively. In our approach the genome assembly is modelled as finding an elementary path in this graph. We formulate the dispersed repeats as linear constraints and we search for an elementary path using Integer Linear Programming similarly to [3]. In our approach inverted repeats correspond to occurrences of contigs paired with other occurrences of them but in reverse orientation. Their positions on the assembled sequence must satisfy nested-pairs pattern. We formulate the above constraints in terms of linear program where the objective is to maximize the nested-pairs number. Thus, we generalize a similar approach applied for RNA folding [4]. Indeed, in contrast to the later approach where the vertices correspond to bases with known sequence indices, in our case the positions of the contigs are variables. Our tool is implemented with Python 3 and uses the open-source PuLP package which integrates a free solver to solve the above optimization problem. Results We tested our program with QUAST [5] and we obtained very encouraging preliminary results, with high genome coverage (mostly >99%), and very low mismatches and indels rates. Conclusions We designed a chloroplast genome dedicated pattern-driven de novo assembler using only short unpaired reads. We formulate the conserved circular and quadripartite structure as linear constraints and implemented this model in an open-source program. Finally, QUAST evaluation returned some encouraging preliminary results
    • …
    corecore